Type of variable | Official name | R data type |
|---|---|---|
Continuous number | numeric | numeric |
Integer | numeric | integer |
Label | categorical | factor |
Systems Biology Lab
2024-11-01
Type of variable | Official name | R data type |
|---|---|---|
Continuous number | numeric | numeric |
Integer | numeric | integer |
Label | categorical | factor |
Predictor variables | Response variable | Type of model |
|---|---|---|
categorical, numeric | categorical | classifier |
categorical, numeric | numeric | regression |
One or more of the predictor variables must carry information about the response variable
The functions \(f, g, \ldots\) can be from the framework of
Goal: Predicting income from Seniority and Years of Education
The best model is the one that fits the training data best
\[y = \beta_0\]
The least-squares fit of this model minimizes the sum of squared differences:
\[ S(\beta_0) = \sum_{i=1}^n (y_i - \beta_0)^2 \]
It is obtained1 when \(\beta_0\) equals the average (sample mean) of measurements \(y_i\):
\[ \beta_0 = \frac{1}{n} \sum_{i=1}^n y_i = \overline{y} \]
\(y=\overline{y}\) will be called the Null Model or M0
The Residual Sum of Squares (RSS) of the Null Model is:
\[ \text{RSS} = \sum_{i=1}^n (y_i - \overline{y})^2 \]
The RSS of the Null Model is also called the Total Sum of Squares (TSS) of the data set \(y_i\)
The Mean Square Error (MSE) of the Null Model is:
\[ \text{MSE} \equiv \frac{\text{RSS}}{n} = \frac{1}{n} \sum \left(y_i - \overline{y} \right)^2 \]
Do you recognize the right hand side?
It is almost the same as the sample variance \(s^2(y_i) = \frac{1}{n-1} \sum \left(y_i - \overline{y} \right)^2\)
Model M0: \(y=\beta_0\)
\(\text{RSS} \, (\equiv \text{TSS}) = 10^{4}\)
Model M1: \(y=\beta_0 + \beta_1 x\)
\(\text{RSS} = 800\)
The Total Sum of Squares (TSS) of a response data set \(y_i\) was defined as
\[ \text{TSS} = \sum_{i=1}^n \left( y_i- \overline{y} \right)^2 \]
The Residual Sum of Squares (RSS) of a model \(y=f(x_1,x_2, \ldots)\) is
\[ \text{RSS} = \sum_{i=1}^n \left( y_i- f(x_{i1},x_{i2}, \ldots) \right)^2 \]
The variance explained is defined as \(R^2 = \frac{\text{TSS}-\text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}\)
What are the maximal and minimal possible values of \(R^2\)? (Think carefully!)
lm()In a linear model the parameters \(\beta_i\) of the model appear linearly in the model equation: each parameter occurs with power 1 in the model equation.
\[ \begin{align*}y &= \beta_0 + \beta_1 x \\ y &= \beta_1 x + \beta_2 x^2 \\ y &= \beta_1 x + \beta_2 \frac{1}{x} \\ y &= \beta_0 + \beta_1 e^{x} \end{align*} \]
\[ \begin{align*} y &= \frac{\beta_0 x}{\beta_1 + x} \\ y &= \beta_1 x + \frac{1}{\beta_2 x} \\ y &= \beta_0 + \beta_2 e^{\beta_3 x} \end{align*} \]
In a linear model the parameters \(\beta_i\) of the model appear linearly in the model equation: each parameter occurs with power 1 in the model equation.
| \(i\) | \(x_i\) | \(y_i\) | \(f(x_i, \vec{\beta})\) | \(\epsilon\) |
|---|---|---|---|---|
| 1 | 0 | 0.1 | 0.2 | -0.1 |
| 2 | 1 | 2.4 | 2.2 | 0.2 |
| 3 | 2 | 5.5 | 6.0 | -0.5 |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
| \(n\) | 100 | 1.1 | 0.0 | 1.1 |
| \(i\) | \(x_{i1}\) | \(x_{i2}\) | \(\ldots\) | \(x_{im}\) | \(y_i\) | \(f(\vec{x}_{i},\vec{\beta})\) | \(\epsilon\) |
|---|---|---|---|---|---|---|---|
| 1 | 0 | 1 | \(\ldots\) | 0 | 0.1 | 0.11 | -0.01 |
| 2 | 1 | 1 | \(\ldots\) | 0 | 2.4 | 2.38 | 0.02 |
| 3 | 2 | 1 | \(\ldots\) | 0 | 5.8 | 6.0 | -0.2 |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
| \(n\) | 100 | 3 | \(\ldots\) | 5 | -0.1 | 0.0 | -0.1 |
For linear models we assume that residuals are drawn from a normal distribution with mean \(0\) and constant standard deviation \(\sigma\).
\[ y = f(\vec{x}_{i}, \vec{\beta}) + \epsilon \qquad \epsilon \thicksim \text{Norm}(0,\sigma) \]
\[ \begin{align*} y &= \beta_0 + \beta_1 x + \epsilon &\qquad \epsilon \thicksim \text{Norm}(0,\sigma) \\ y &= \beta_1 x_1 + \beta_2 x_1 x_2 + \epsilon &\qquad \epsilon \thicksim \text{Norm}(0,\sigma) \end{align*} \]
\[ \begin{align*} y &\thicksim \text{Norm}(\beta_0 + \beta_1 x, \sigma) \\ y &\thicksim \text{Norm}(\beta_1 x_1 + \beta_2 x_1 x_2, \sigma) \end{align*} \]
Interpretation: \(y\) itself is drawn from a normal distribution whose mean depends on \(x, x_1, x_2\)
Having a linear model equation
\[ y = f(\vec{x},\vec{\beta}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_m x_m + \epsilon \qquad \epsilon \thicksim \text{Norm}(0,\sigma) \]
Having \(n\) measurement tuples \((y_i, x_{ij})\) with \(i=1 \ldots n\) and \(j=1 \ldots m\), the predicted responses \(y'_i = f(\vec{x}_i, \vec{\beta})\) can be written as the matrix equation
\[ \begin{bmatrix} y'_1 \\ y'_2 \\ \vdots \\ y'_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1m} \\ 1 & x_{21} & x_{22} & \ldots & x_{2m} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{nm} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_m \end{bmatrix} \]
abbreviated as
\[ \vec{y}' = \boldsymbol{X} \vec{\beta} \]
For model equation
\[ y = f(x,\vec{\beta}) = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon \qquad \epsilon \thicksim \text{Norm}(0,\sigma) \]
this becomes
\[ \begin{bmatrix} y'_1 \\ y'_2 \\ \vdots \\ y'_n \end{bmatrix} = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} \]
Notice that in the expression \(\boldsymbol{X}\vec{\beta}\):
Anything is allowed for the predictor terms:
Example:
\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} \] has in general no solution for \(\vec{\beta}\)
\[ \vec{\epsilon} = \vec{y} - \vec{y}' = \vec{y} - \boldsymbol{X} \vec{\beta} \]
Fitting a function to the data means tuning values for the parameters \(\beta_0, \beta_1, \ldots\) such that the sum of squared errors is minimized, or residual sum of squares:
\[ \text{RSS} (\vec{\beta}) = \sum_{i=1}^n \epsilon_i^2 = \|\vec{\epsilon}\|^2 = \|\vec{y} - \boldsymbol{X} \vec{\beta}\|^2 \]
The perpendicular projection of \(\vec{y}\) onto the vector \(\vec{y}'\) in the column space \(\mathcal{C}(\boldsymbol{X})\) of \(\boldsymbol{X}\) is the least-squares solution
So, \(\vec{\epsilon}\) must be perpendicular to the column space \(\mathcal{C}(\boldsymbol{X})\) of \(\boldsymbol{X}\), or
\[ \boldsymbol{X}^T\vec{\epsilon} = \vec{0} \]
Since \(\vec{\epsilon} = \vec{y} - \boldsymbol{X}\vec{\beta}\), we get
\[\boldsymbol{X}^T \left( \vec{y} - \boldsymbol{X}\vec{\beta} \right) = \vec{0}\] This is the equation to be solved for \(\vec{\beta}\) to obtain the least-squares fit, or \[ \boldsymbol{X}^T \boldsymbol{X} \vec{\beta} = \boldsymbol{X}^T \vec{y} \]
This equation can be solved for \(\vec{\beta}\) by Gaussian elimination
lm()
Call:
lm(formula = y ~ x + I(x^2), data = d)
Residuals:
Min 1Q Median 3Q Max
-4.0840 -3.1905 -0.2577 1.7967 7.9306
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.89377 2.37222 3.749 0.00243 **
x 7.52765 0.73380 10.258 1.34e-07 ***
I(x^2) 0.59503 0.04719 12.609 1.15e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.567 on 13 degrees of freedom
Multiple R-squared: 0.9982, Adjusted R-squared: 0.998
F-statistic: 3697 on 2 and 13 DF, p-value: < 2.2e-16
lm() when the predictor is categorical!| \(x\) | \(y\) (cm) |
|---|---|
| M | 173 |
| F | 177 |
| M | 182 |
| \(\vdots\) | \(\vdots\) |
| F | 165 |
Under the hood a dummy variable \(x'\) with values 0 (for F) or 1 (for M) is defined
The line is a fit of the model \(y = \beta_0 + \beta_1 x'\), i.e. to the data with the dummy variable
Model equation: \(y = \beta_0 + \beta_1 x'\)
In the example of body lengths of males and females:
gender carry information about the variable height?
Call:
lm(formula = y ~ x, data = d)
Residuals:
Min 1Q Median 3Q Max
-15.858 -4.923 -1.258 5.367 11.642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 165.317 1.967 84.050 < 2e-16 ***
xM 15.042 2.782 5.408 1.98e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.814 on 22 degrees of freedom
Multiple R-squared: 0.5707, Adjusted R-squared: 0.5511
F-statistic: 29.24 on 1 and 22 DF, p-value: 1.975e-05
The value 0 is outside the 95% confidence interval of xM, so we would discard the hypothesis that \(\beta_1=0\) at a rejection level \(\alpha=0.05\)
Dummy encoding of a categorical variable with
etc: with every level a new dimension is added.
In this dummy encoding scheme only one of the dimensions can have a value 1.
Example of a model equation having a term with such a dummy variable with three levels:
\[y = \beta_0 + \vec{\beta} \vec{x}'\]
in which the term \(\vec{\beta}\) contains two parameters, each of which is “selected” depending on the value of the dummy \(\vec{x}\).
Example:
\[[\beta_{1}\, \beta_{2}]\Bigl[\begin{smallmatrix}0 \\ 1 \end{smallmatrix}\Bigr] = \beta_{2}\]
The dummy variable can assume three values: \([\begin{smallmatrix}0 \\ 0 \end{smallmatrix}]\), \([\begin{smallmatrix}1 \\ 0 \end{smallmatrix}]\), and \([\begin{smallmatrix}0 \\ 1 \end{smallmatrix}]\)
The equivalent of the model \(y = \beta_0 + \vec{\beta} \vec{x}'\) is the set of equations:
\[y = \begin{cases} \beta_0 & \text{if } x =\text{level 1} \text{, or } [\begin{smallmatrix}0 \\ 0 \end{smallmatrix}] \\ \beta_0 + \beta_{1} & \text{if } x =\text{level 2} \text{, or } [\begin{smallmatrix}1 \\ 0 \end{smallmatrix}] \\ \beta_0 + \beta_{2} & \text{if } x =\text{level 3} \text{, or } [\begin{smallmatrix}0 \\ 1 \end{smallmatrix}] \\ \end{cases}\]
| \(x\) | \(y\) |
|---|---|
| A | 98 |
| B | 138 |
| C | 172 |
| \(\vdots\) | \(\vdots\) |
| B | 146 |
lm() function takes care of dummy encoding(Intercept): the average \(y\)-value for class “A” samplesxB: average value of class “B” minus average value of class “A” samplesxC: average value of class “C” minus average value of class “A” samplesClass “A” is apparently used as a reference